graph LR
subgraph Dense["Dense Model"]
direction TB
D1["All parameters<br/>activated for<br/>every token"]
end
subgraph Sparse["Sparse MoE Model"]
direction TB
S1["Only selected<br/>experts activated<br/>per token"]
end
A["Input<br/>Token"] --> Dense
A --> Sparse
Dense --> D2["High compute<br/>High quality"]
Sparse --> S2["Low compute<br/>Same quality"]
style D1 fill:#e74c3c,color:#fff,stroke:#333
style S1 fill:#27ae60,color:#fff,stroke:#333
style D2 fill:#e74c3c,color:#fff,stroke:#333
style S2 fill:#27ae60,color:#fff,stroke:#333
style A fill:#4a90d9,color:#fff,stroke:#333
Training LLMs with Mixture of Experts
From dense to sparse: understanding MoE architecture, routing strategies, expert specialization, sparse upcycling, and fine-tuning MoE models with PyTorch and Unsloth
Keywords: Mixture of Experts, MoE, sparse model, router, expert specialization, Mixtral, DeepSeekMoE, OLMoE, Switch Transformer, top-k routing, load balancing, sparse upcycling, fine-tuning, Unsloth, LoRA, PyTorch

Introduction
Scaling dense language models has been the dominant recipe for better performance — more parameters and more data. But dense scaling hits practical limits: training costs grow linearly with parameter count, inference latency increases, and memory requirements balloon. Mixture of Experts (MoE) offers an elegant alternative: scale model capacity without proportionally scaling computation.
The key idea is conditional computation — instead of activating all parameters for every input token, a MoE model selects a small subset of “experts” per token. This means a model with 47B total parameters might only use 13B active parameters per token (as in Mixtral 8x7B), achieving the quality of a much larger dense model at the inference cost of a much smaller one.
This article covers MoE architecture from the ground up: how routing works, how to balance load across experts, the design innovations from Mixtral, DeepSeek, and OLMoE, and how to build and fine-tune your own MoE models. All examples target small, practical MoE configurations.
For the full pretraining pipeline (data collection, cleaning, tokenization), see Pre-training LLMs from Scratch. For post-training alignment, see Post-Training LLMs for Human Alignment.
Dense vs Sparse: Why MoE?
| Aspect | Dense Model | MoE Model |
|---|---|---|
| Parameters used per token | All | Subset (e.g., 2 of 8 experts) |
| Training speed | Baseline | 2–4x faster at same quality |
| Inference FLOPs | Proportional to total params | Proportional to active params |
| Memory (inference) | Proportional to total params | All experts must be loaded |
| Example | Llama 3 70B (70B active) | Mixtral 8x7B (13B active / 47B total) |
“Model capacity depends on total parameters, but inference speed depends on active parameters.” — HuggingFace MoE Blog
1. MoE Architecture
In a standard Transformer, each layer has a self-attention block followed by a feed-forward network (FFN). In a MoE Transformer, some or all FFN layers are replaced by MoE layers consisting of:
- Multiple experts — each expert is a standard FFN (same architecture, independently parameterized)
- A router (gating network) — a small learned network that decides which experts process each token
graph TD
A["Input Hidden State<br/>(per token)"] --> B["Router<br/>(Gating Network)"]
B -->|"weight=0.7"| E1["Expert 1<br/>(FFN)"]
B -->|"weight=0.3"| E2["Expert 2<br/>(FFN)"]
B -->|"weight=0.0"| E3["Expert 3<br/>(FFN)"]
B -->|"weight=0.0"| E4["Expert 4<br/>(FFN)"]
E1 --> C["Weighted Sum<br/>of Expert Outputs"]
E2 --> C
C --> D["Output<br/>Hidden State"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style E1 fill:#27ae60,color:#fff,stroke:#333
style E2 fill:#27ae60,color:#fff,stroke:#333
style E3 fill:#95a5a6,color:#fff,stroke:#333
style E4 fill:#95a5a6,color:#fff,stroke:#333
style C fill:#f5a623,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
The router computes a probability distribution over experts for each token. Only the top-k experts (typically k=1 or k=2) are activated, and their outputs are combined with the router weights:
y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x)
where G(x) is the gating function (zero for non-selected experts) and E_i(x) is the output of expert i.
Where MoE Layers Are Placed
Not every Transformer layer needs to be a MoE layer. Common patterns:
| Pattern | Description | Used By |
|---|---|---|
| Every layer | All FFN layers replaced with MoE | Mixtral, OLMoE |
| Every other layer | Alternating dense and MoE layers | GShard, GLaM |
| Every N layers | Sparse MoE placement | ST-MoE (every 4th) |
Parameter Counting
For Mixtral 8x7B:
- Total parameters: ~47B (not 8×7B=56B, because attention layers are shared)
- Active parameters per token: ~13B (shared attention + 2 selected experts)
- Each expert FFN: ~5B parameters
- Shared layers (attention, embeddings, LM head): ~7B parameters
2. Routing Strategies
The router is the most critical design decision in a MoE. It determines which tokens go to which experts.
graph TD
A{{"Routing<br/>Strategies"}} --> B["Top-K<br/>Routing"]
A --> C["Expert<br/>Choice"]
A --> D["Token<br/>Choice"]
B --> B1["Token picks top-K experts<br/>Most common approach<br/>K=1 (Switch) or K=2 (Mixtral)"]
C --> C1["Expert picks top-K tokens<br/>Better load balance<br/>Used by some research models"]
D --> D1["Soft routing / learned<br/>Differentiable assignment<br/>Emerging approach"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#4a90d9,color:#fff,stroke:#333
style C fill:#f5a623,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style B1 fill:#4a90d9,color:#fff,stroke:#333
style C1 fill:#f5a623,color:#fff,stroke:#333
style D1 fill:#27ae60,color:#fff,stroke:#333
Top-K Token-Choice Routing
The standard approach. Each token selects its top-K experts:
import torch
import torch.nn as nn
import torch.nn.functional as F
class TopKRouter(nn.Module):
"""Standard top-k router for Mixture of Experts."""
def __init__(self, hidden_dim, num_experts, top_k=2):
super().__init__()
self.top_k = top_k
self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
def forward(self, x):
# x: (batch * seq_len, hidden_dim)
logits = self.gate(x) # (batch * seq_len, num_experts)
scores = F.softmax(logits, dim=-1)
# Select top-k experts per token
top_k_scores, top_k_indices = torch.topk(scores, self.top_k, dim=-1)
# Normalize selected scores to sum to 1
top_k_scores = top_k_scores / top_k_scores.sum(dim=-1, keepdim=True)
return top_k_scores, top_k_indices, logitsNoisy Top-K Gating
Adding noise during training helps with load balancing and exploration:
H(x)_i = (x \cdot W_g)_i + \text{StandardNormal}() \cdot \text{Softplus}((x \cdot W_{noise})_i)
G(x) = \text{Softmax}(\text{TopK}(H(x), k))
Switch Transformer: Top-1 Routing
The Switch Transformer simplified routing by using only one expert per token (K=1):
- Reduces router computation
- Halves the batch size per expert (vs top-2)
- Reduces communication costs
- Quality is preserved
This was counterintuitive — the original assumption was that at least two experts were needed. Switch Transformers showed top-1 can work very well, achieving a 4x pre-train speed-up over T5-XXL.
3. Load Balancing and Training Stability
Without intervention, routers converge to send most tokens to a few “popular” experts, creating a vicious cycle: favored experts train faster, get selected more, and other experts are wasted.
graph TD
A["Unbalanced<br/>Routing"] --> B["Popular experts<br/>get more tokens"]
B --> C["Popular experts<br/>train faster"]
C --> D["Router reinforces<br/>same experts"]
D --> A
E["Auxiliary<br/>Loss"] -->|"breaks the cycle"| A
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#e74c3c,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
Auxiliary Load Balancing Loss
An auxiliary loss encourages uniform expert usage. For each MoE layer, the loss penalizes imbalanced routing:
def load_balancing_loss(router_logits, top_k_indices, num_experts):
"""Compute auxiliary load balancing loss (Switch Transformer style)."""
# Fraction of tokens routed to each expert
tokens_per_expert = torch.zeros(num_experts, device=top_k_indices.device)
for i in range(num_experts):
tokens_per_expert[i] = (top_k_indices == i).float().sum()
fraction_tokens = tokens_per_expert / top_k_indices.numel()
# Average routing probability for each expert
routing_probs = F.softmax(router_logits, dim=-1)
fraction_probs = routing_probs.mean(dim=0)
# Load balancing loss: N * sum(f_i * P_i)
# Minimized when both distributions are uniform (1/N each)
loss = num_experts * (fraction_tokens * fraction_probs).sum()
return lossThe total training loss becomes:
\mathcal{L} = \mathcal{L}_{LM} + \alpha \cdot \mathcal{L}_{aux}
where \alpha is typically a small constant (0.01–0.1).
Router Z-Loss
Introduced by ST-MoE, the router z-loss improves stability without quality degradation by penalizing large logits entering the gating network:
\mathcal{L}_z = \frac{1}{B} \sum_{i=1}^{B} \left(\log \sum_{j=1}^{N} e^{x_j^{(i)}}\right)^2
This reduces roundoff errors in the softmax exponential, which is especially impactful when training in mixed precision.
Expert Capacity Factor
Expert capacity limits how many tokens one expert can process:
\text{Expert Capacity} = \frac{\text{tokens per batch}}{\text{number of experts}} \times \text{capacity factor}
Tokens exceeding capacity are “dropped” (passed through via residual connection). Good starting points:
| Capacity Factor | Trade-off |
|---|---|
| 1.0 | Efficient, some tokens dropped |
| 1.25 | Good balance (recommended) |
| 1.5+ | Fewer drops, higher memory/communication |
4. Expert Specialization Strategies
Different MoE architectures approach expert design differently. The key innovations come from how experts are structured and organized.
graph TD
A{{"Expert Design<br/>Strategies"}} --> B["Standard MoE<br/>(Mixtral)"]
A --> C["Fine-Grained MoE<br/>(DeepSeekMoE)"]
A --> D["Shared + Routed<br/>(DeepSeek-V2/V3)"]
B --> B1["N large experts<br/>Top-K routing<br/>e.g. 8 experts, top-2"]
C --> C1["mN smaller experts<br/>Top-mK routing<br/>More flexible combinations"]
D --> D1["K_s shared experts<br/>always active +<br/>routed experts"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#4a90d9,color:#fff,stroke:#333
style C fill:#f5a623,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style B1 fill:#4a90d9,color:#fff,stroke:#333
style C1 fill:#f5a623,color:#fff,stroke:#333
style D1 fill:#27ae60,color:#fff,stroke:#333
Mixtral: Standard Top-2 MoE
Mixtral 8x7B uses a straightforward design:
- 8 experts per layer, each a full FFN (same size as Mistral 7B’s FFN)
- Top-2 routing: each token activates exactly 2 experts
- Every layer is a MoE layer
- 32k token context length
- Outperforms Llama 2 70B while using only 13B active parameters
DeepSeekMoE: Fine-Grained Expert Segmentation
DeepSeekMoE introduces two key ideas for better expert specialization:
- Fine-grained segmentation: Instead of N large experts with top-K routing, use mN smaller experts with top-mK routing. More small experts allow more flexible combinations.
- Shared expert isolation: Dedicate K_s experts as “shared experts” that are always active for every token, capturing common knowledge and reducing redundancy in routed experts.
Result: DeepSeekMoE 2B matches GShard 2.9B (which has 1.5x more expert parameters and compute). DeepSeekMoE 16B matches Llama 2 7B quality with only 40% of the compute.
OLMoE: Fully Open MoE
OLMoE-1B-7B is the most practical open MoE model:
- 7B total parameters, 1B active per token
- 64 experts per layer, top-8 routing
- Pretrained on 5 trillion tokens
- Fully open: weights, training data, code, and logs
- Outperforms models with similar active params, even larger ones like Llama2-13B-Chat
5. Notable MoE Models Comparison
| Model | Total Params | Active Params | Experts | Top-K | Key Innovation |
|---|---|---|---|---|---|
| Switch Transformer | 1.6T | varies | 2048 | 1 | Simplified routing |
| Mixtral 8x7B | 47B | 13B | 8 | 2 | Strong open MoE |
| DeepSeekMoE 16B | 16B | 2.8B | 64 | 6 | Fine-grained + shared experts |
| DeepSeek-V2 | 236B | 21B | 160 | 6 | MLA + DeepSeekMoE |
| DeepSeek-V3 | 671B | 37B | 256 | 8 | Multi-Token Prediction |
| OLMoE-1B-7B | 7B | 1B | 64 | 8 | Fully open, small scale |
| Qwen3-30B-A3B | 30B | 3B | 128 | 8 | Thinking + non-thinking |
| gpt-oss-20b | 21B | 3.6B | 32 | 4 | OpenAI’s open MoE |
6. Building a MoE Layer from Scratch
Here’s a complete, minimal MoE implementation in PyTorch:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Expert(nn.Module):
"""Single expert: a standard FFN with SwiGLU activation."""
def __init__(self, hidden_dim, intermediate_dim):
super().__init__()
self.gate_proj = nn.Linear(hidden_dim, intermediate_dim, bias=False)
self.up_proj = nn.Linear(hidden_dim, intermediate_dim, bias=False)
self.down_proj = nn.Linear(intermediate_dim, hidden_dim, bias=False)
def forward(self, x):
return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
class MoELayer(nn.Module):
"""Mixture of Experts layer with top-k routing."""
def __init__(self, hidden_dim, intermediate_dim, num_experts, top_k=2,
aux_loss_coeff=0.01):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.aux_loss_coeff = aux_loss_coeff
# Router
self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
# Experts
self.experts = nn.ModuleList([
Expert(hidden_dim, intermediate_dim)
for _ in range(num_experts)
])
def forward(self, x):
batch_size, seq_len, hidden_dim = x.shape
x_flat = x.view(-1, hidden_dim) # (B*S, D)
# Compute routing scores
logits = self.gate(x_flat) # (B*S, num_experts)
scores = F.softmax(logits, dim=-1)
# Select top-k experts
top_k_scores, top_k_indices = torch.topk(
scores, self.top_k, dim=-1
)
top_k_scores = top_k_scores / top_k_scores.sum(dim=-1, keepdim=True)
# Compute expert outputs
output = torch.zeros_like(x_flat)
for i, expert in enumerate(self.experts):
# Find tokens routed to this expert
mask = (top_k_indices == i).any(dim=-1) # (B*S,)
if not mask.any():
continue
token_indices = mask.nonzero(as_tuple=True)[0]
expert_input = x_flat[token_indices]
expert_output = expert(expert_input)
# Weight by routing score
for k in range(self.top_k):
k_mask = top_k_indices[token_indices, k] == i
if k_mask.any():
weight = top_k_scores[token_indices[k_mask], k]
output[token_indices[k_mask]] += (
weight.unsqueeze(-1) * expert_output[k_mask]
)
# Auxiliary load balancing loss
self.aux_loss = self._load_balancing_loss(logits, top_k_indices)
return output.view(batch_size, seq_len, hidden_dim)
def _load_balancing_loss(self, logits, top_k_indices):
num_tokens = logits.shape[0]
# Fraction of tokens per expert
tokens_per_expert = torch.zeros(
self.num_experts, device=logits.device
)
for i in range(self.num_experts):
tokens_per_expert[i] = (top_k_indices == i).float().sum()
f = tokens_per_expert / (num_tokens * self.top_k)
# Mean routing probability per expert
p = F.softmax(logits, dim=-1).mean(dim=0)
return self.aux_loss_coeff * self.num_experts * (f * p).sum()Plugging MoE into a Transformer
class MoETransformerBlock(nn.Module):
"""Transformer block with MoE FFN."""
def __init__(self, hidden_dim, num_heads, intermediate_dim,
num_experts, top_k):
super().__init__()
self.attn_norm = nn.RMSNorm(hidden_dim)
self.attention = nn.MultiheadAttention(
hidden_dim, num_heads, batch_first=True
)
self.ffn_norm = nn.RMSNorm(hidden_dim)
self.moe = MoELayer(
hidden_dim, intermediate_dim, num_experts, top_k
)
def forward(self, x):
# Self-attention (shared across all tokens)
h = self.attn_norm(x)
h, _ = self.attention(h, h, h)
x = x + h
# MoE FFN (sparse per token)
h = self.ffn_norm(x)
h = self.moe(h)
x = x + h
return xSmall MoE Configuration
For a practical small MoE model (~1B active, ~7B total):
config = {
"hidden_dim": 2048,
"intermediate_dim": 5632, # per expert
"num_layers": 16,
"num_heads": 16,
"num_experts": 8,
"top_k": 2,
"vocab_size": 32000,
"max_seq_length": 2048,
}
# Parameter count estimate:
# Shared (attention + embeddings): ~1B
# Per expert FFN: ~70M × 8 experts × 16 layers = ~9B
# Total: ~10B, Active: ~2.3B (attention + 2 of 8 experts)7. Sparse Upcycling: Dense to MoE
Training a MoE from scratch is expensive. Sparse upcycling offers a practical shortcut: initialize a MoE from a pre-trained dense model checkpoint.
graph LR
A["Dense Model<br/>(pre-trained)"] --> B["Copy FFN weights<br/>to all experts"]
B --> C["Add random<br/>router weights"]
C --> D["Continue training<br/>as MoE"]
D --> E["MoE Model<br/>(better quality)"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
The process:
- Take a pre-trained dense model
- Copy the FFN weights to initialize every expert (all experts start identical)
- Add randomly initialized router weights
- Continue pretraining — experts will naturally diverge and specialize
Sparse upcycling achieves the quality of training from scratch while using only ~50% of the original dense training compute.
Upcycling Implementation
import torch
import copy
def upcycle_dense_to_moe(dense_model, num_experts=8, top_k=2):
"""Convert a dense transformer to MoE by duplicating FFN layers."""
for layer_idx, layer in enumerate(dense_model.layers):
# Get the original dense FFN
original_ffn = layer.ffn
hidden_dim = original_ffn.gate_proj.in_features
intermediate_dim = original_ffn.gate_proj.out_features
# Create MoE layer
moe = MoELayer(
hidden_dim=hidden_dim,
intermediate_dim=intermediate_dim,
num_experts=num_experts,
top_k=top_k,
)
# Copy dense FFN weights to ALL experts
for expert in moe.experts:
expert.load_state_dict(original_ffn.state_dict())
# Router starts with small random weights
nn.init.xavier_uniform_(moe.gate.weight, gain=0.01)
# Replace dense FFN with MoE
layer.ffn = moe
return dense_modelOLMoE’s Upcycling Recipe
OLMoE provides a complete open-source upcycling pipeline starting from the OLMo 1B dense checkpoint:
# 1. Clone OLMoE repo
git clone https://github.com/allenai/OLMo.git -b Muennighoff/MoE
cd OLMo && pip install -e .
# 2. Install megablocks for efficient MoE training
pip install git+https://github.com/Muennighoff/megablocks.git@olmoe
# 3. Download dense checkpoint and convert to MoE
# (script duplicates FFN to 8 experts, adds router)
python scripts/sparsify_ckpt_unsharded.py
# 4. Continue training with MoE config
python scripts/train.py configs/OLMoE-1B-7B-0924.yml \
--load_path=path_to_upcycled_ckpt \
--reset_optimizer_state=True \
--reset_trainer_state=True8. Training a MoE from Scratch
For full control, you can train a MoE from random initialization. The training loop adds the auxiliary loss:
import torch
from torch.utils.data import DataLoader
# Model setup
model = MoETransformerModel(config) # your MoE model
model.to("cuda")
optimizer = torch.optim.AdamW(
model.parameters(),
lr=2e-4,
betas=(0.9, 0.95),
weight_decay=0.1,
)
# Training loop
for step, batch in enumerate(dataloader):
input_ids = batch["input_ids"].to("cuda")
labels = batch["labels"].to("cuda")
# Forward pass
outputs = model(input_ids=input_ids, labels=labels)
lm_loss = outputs.loss
# Collect auxiliary losses from all MoE layers
aux_loss = sum(
layer.moe.aux_loss
for layer in model.layers
if hasattr(layer, "moe")
)
# Total loss
total_loss = lm_loss + aux_loss
total_loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
optimizer.zero_grad()
if step % 100 == 0:
print(
f"Step {step} | LM Loss: {lm_loss.item():.4f} | "
f"Aux Loss: {aux_loss.item():.4f}"
)Training Hyperparameters for Small MoE Models
Based on OLMoE and DeepSeekMoE recipes:
| Hyperparameter | OLMoE-1B-7B | DeepSeekMoE-16B | Mixtral 8x7B |
|---|---|---|---|
| Total params | 7B | 16B | 47B |
| Active params | 1B | 2.8B | 13B |
| Experts | 64 | 64 | 8 |
| Top-K | 8 | 6 | 2 |
| Learning rate | 3e-4 | 4.2e-4 | ~2e-4 |
| Batch size (tokens) | 4M | 4.5M | ~4M |
| Optimizer | AdamW | AdamW | AdamW |
| Training tokens | 5T | 2T | undisclosed |
| Aux loss weight | 0.01 | 0.01 | 0.01 |
| Context length | 2048 → 4096 | 4096 | 32768 |
Training Stability Tips
MoE training is less stable than dense training. Key practices:
- Use router z-loss — penalizes large router logits, prevents instability from exponentials
- Selective precision — keep router computation in full precision (fp32), even when experts use bf16
- Warmup — longer warmup helps stabilize routing (2000–5000 steps)
- Monitor expert utilization — if any expert consistently gets <1% of tokens, routing has collapsed
- Auxiliary loss coefficient — start with 0.01, increase if load is very unbalanced
- Don’t fine-tune the router — during LoRA fine-tuning, Unsloth disables router updates by default
9. Fine-tuning MoE Models with Unsloth
Unsloth provides optimized MoE training with custom Triton kernels, achieving ~12x faster training and >35% VRAM reduction compared to standard implementations.
graph TD
A["Pre-trained MoE<br/>(e.g. Qwen3-30B-A3B)"] --> B["Add LoRA adapters<br/>to expert layers"]
B --> C["Fine-tune with<br/>Unsloth + TRL"]
C --> D["Merge & Export<br/>(GGUF / safetensors)"]
D --> E["Deploy with<br/>vLLM / Ollama"]
style A fill:#4a90d9,color:#fff,stroke:#333
style B fill:#f5a623,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#9b59b6,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
Setting Up MoE Fine-tuning
from unsloth import FastLanguageModel
# Load a MoE model (bf16 — QLoRA not supported for MoE yet)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen3-30B-A3B",
max_seq_length=4096,
load_in_4bit=False, # MoE nn.Parameter doesn't support bnb 4bit yet
)
# Add LoRA adapters to MoE expert layers
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # attention
"gate_up_proj", "down_proj", # LoRA on MoE expert layers
],
lora_alpha=32,
use_gradient_checkpointing="unsloth",
random_state=42,
)Unsloth’s Split LoRA for MoE
Unsloth avoids materializing the full LoRA delta for all experts. Instead of the standard approach:
\Delta = A \cdot B \quad \text{(materialized for all E experts)}
Unsloth computes:
Y = X \cdot A \quad \text{(only for routed tokens)} \rightarrow Z = Y \cdot B
This reordering (enabled by matrix multiplication associativity) reduces memory from O(E \cdot m \cdot n) to O(k \cdot s \cdot (r + n)), where E is total experts, k is active experts, s is sequence length, and r is LoRA rank.
Training the MoE
from trl import SFTTrainer, SFTConfig
# Prepare instruction dataset
dataset = load_dataset("your_dataset", split="train")
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=SFTConfig(
output_dir="qwen3-moe-finetuned",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-5,
max_seq_length=4096,
warmup_steps=100,
bf16=True,
logging_steps=10,
save_steps=500,
weight_decay=0.01,
lr_scheduler_type="cosine",
seed=42,
),
dataset_text_field="text",
)
trainer.train()Choosing the MoE Backend
Unsloth auto-selects the optimal MoE backend, but you can override:
import os
# Options: "grouped_mm" (default), "unsloth_triton", "native_torch"
os.environ["UNSLOTH_MOE_BACKEND"] = "grouped_mm"| Backend | Speed | Compatibility | Notes |
|---|---|---|---|
| grouped_mm | Fast | T4+ (PyTorch 2.4+) | Default, good balance |
| unsloth_triton | Fastest on A100 | A100/H100 | ~2.5x faster than grouped_mm on A100 |
| native_torch | Slow | All hardware | For-loop fallback |
Exporting the Fine-tuned MoE
# Save merged model
model.save_pretrained_merged(
"qwen3-moe-merged",
tokenizer,
save_method="merged_16bit",
)
# Export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf(
"qwen3-moe-gguf",
tokenizer,
quantization_method="q4_k_m",
)For serving with Ollama or llama.cpp, see Run LLM locally with Ollama and Deploying and Serving LLM with Llama.cpp.
10. MoE Fine-tuning Dynamics
MoE models have unique fine-tuning characteristics compared to dense models:
graph TD
A{{"Fine-tuning<br/>Considerations"}} --> B["Overfitting"]
A --> C["Expert Freezing"]
A --> D["Instruction Tuning"]
B --> B1["MoE overfits more easily<br/>Use higher dropout<br/>Smaller batch, higher LR"]
C --> C1["Freeze MoE layers<br/>Update only shared layers<br/>~Same quality, faster training"]
D --> D1["MoE benefits MORE from<br/>instruction tuning than dense<br/>Flan-MoE >> MoE"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#4a90d9,color:#fff,stroke:#333
style C fill:#f5a623,color:#fff,stroke:#333
style D fill:#27ae60,color:#fff,stroke:#333
style B1 fill:#4a90d9,color:#fff,stroke:#333
style C1 fill:#f5a623,color:#fff,stroke:#333
style D1 fill:#27ae60,color:#fff,stroke:#333
Key Findings from Research
| Finding | Dense | MoE |
|---|---|---|
| Overfitting risk | Lower | Higher (more params) |
| Optimal batch size | Larger | Smaller |
| Optimal learning rate | Lower | Higher |
| Instruction tuning benefit | Good | Even better |
| Auxiliary loss at fine-tuning | N/A | Can turn off (acts as regularization) |
| Freezing non-expert layers | Hurts quality | Works ~as well as full fine-tuning |
| Knowledge tasks (TriviaQA) | Good | MoE excels disproportionately |
| Reasoning tasks (SuperGLUE) | Better | MoE struggles more |
Practical OLMoE Fine-tuning
OLMoE provides a complete adaptation pipeline with SFT → DPO:
# SFT (Supervised Fine-Tuning)
accelerate launch \
--mixed_precision bf16 \
--num_processes 8 \
--use_deepspeed \
open_instruct/finetune.py \
--model_name_or_path allenai/OLMoE-1B-7B-0924 \
--use_flash_attn \
--max_seq_length 4096 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-5 \
--num_train_epochs 2 \
--output_dir output/olmoe-sft
# DPO (Direct Preference Optimization)
accelerate launch \
--mixed_precision bf16 \
--num_processes 8 \
--use_deepspeed \
open_instruct/dpo_tune.py \
--model_name_or_path allenai/OLMoE-1B-7B-0924-SFT \
--dataset_name argilla/ultrafeedback-binarized-preferences-cleaned \
--max_seq_length 4096 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-7 \
--num_train_epochs 3 \
--dpo_beta 0.1Comparison: When to Use MoE vs Dense
| Scenario | Recommendation | Why |
|---|---|---|
| High throughput serving | MoE | Lower per-token compute |
| Limited VRAM | Dense | MoE loads all experts in memory |
| Fixed training budget | MoE | Better quality per FLOP |
| Small fine-tuning dataset | Dense | MoE overfits more easily |
| Knowledge-heavy tasks | MoE | Experts store more knowledge |
| Reasoning-heavy tasks | Dense | Dense generalizes better |
| Single consumer GPU | Dense (or small MoE) | MoE VRAM is high |
| Multi-GPU cluster | MoE | Expert parallelism shines |
Practical Recommendations
For single consumer GPU (16–24 GB):
- Use a small MoE like OLMoE-1B-7B (fits in ~16GB with quantization)
- Fine-tune with Unsloth LoRA on expert layers
- Export to GGUF and serve with Ollama or Llama.cpp
For multi-GPU setup (4–8 GPUs):
- Start with sparse upcycling from a dense checkpoint (see Pre-training LLMs from Scratch)
- Use the OLMoE or megablocks training pipeline
- Fine-tune with SFT → DPO for instruction following
- Deploy with vLLM using expert parallelism
Conclusion
Mixture of Experts is the architecture behind the most capable modern LLMs — from DeepSeek-V3/R1 to GPT-4 to Qwen3. The core insight is simple: make models bigger without making them slower by only activating a subset of parameters per token.
Key takeaways:
- MoE replaces dense FFN layers with multiple expert FFNs and a learned router
- Routing is critical — top-k token-choice with load balancing loss is the standard
- Expert specialization improves with fine-grained segmentation (DeepSeekMoE) and shared experts
- Sparse upcycling converts a dense model to MoE using ~50% of original training compute
- Fine-tuning MoE benefits from smaller batches, higher learning rates, and instruction tuning
- Unsloth provides optimized MoE training with ~12x speedup using Split LoRA and Triton kernels
The tools are mature and open source. OLMoE provides a fully reproducible recipe for training a competitive MoE from scratch, and Unsloth makes fine-tuning accessible on consumer hardware.
For the complete pretraining pipeline, see Pre-training LLMs from Scratch. For alignment techniques, see Post-Training LLMs for Human Alignment. For reasoning training, see Training LLMs for Reasoning.
References
- Jiang et al., Mixtral of Experts, 2024. arXiv:2401.04088
- Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models, 2024. arXiv:2401.06066
- DeepSeek-AI, DeepSeek-V2: A Strong, Economical, and Efficient MoE Language Model, 2024. arXiv:2405.04434
- Muennighoff et al., OLMoE: Open Mixture-of-Experts Language Models, 2024. arXiv:2409.02060
- Fedus et al., Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, 2022. arXiv:2101.03961
- Komatsuzaki et al., Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints, 2022. arXiv:2212.05055
- Zoph et al., ST-MoE: Designing Stable and Transferable Sparse Expert Models, 2022. arXiv:2202.08906
- Sanseviero et al., Mixture of Experts Explained, HuggingFace Blog, 2023. Blog
- Gosthipaty et al., Mixture of Experts (MoEs) in Transformers, HuggingFace Blog, 2026. Blog
- Unsloth Team, Fine-tune MoE Models 12x Faster, 2026. Docs
Read More
- Try OLMoE-1B-7B-Instruct — smallest open MoE that rivals much larger models
- Explore Unsloth MoE notebooks for hands-on MoE fine-tuning
- Read the OLMoE paper for the most detailed open MoE training recipe
- Check megablocks for efficient sparse MoE GPU kernels
- Experiment with sparse upcycling on your own dense model using the OLMoE codebase